Timeless is a TV show about a very strange assortment of people going on adventures with a time machine to effectively save the world, or at least history as we know it. Even though I firmly believe in causality with a strong background in physics, I found myself enjoying the show and tuning in for more every Monday night. As the first season ended on the 20th of February, and there is no word on the show being renewed or cancelled for a new season yet, this project is my tribute to encourage the #RenewTimeless trend.
The figures and analyses below aim to demonstrate how main scenes and events of the TV show can be detected by observing tweet activity, how Timeless fans interact with each other via retweeting, where they tweet from and which hashtags and phrases they tend to use the most.
This project was conceived and carried out for the sole purpose of fun and practice. No great conclusions are meant to be discovered from the analysis below.
Coding aspects of the analysis are presented in lightblue lettering and the "I'd like to see the codes too" button at the end of the page redirects to a full analysis with python scripts and snippets used to process data for those interested in data science.
Basic statsAir timesSpecific scenesCharacter pairings
Retweet networkGeolocationHashtags & phrasesMedia
Using the Twitter API, tweets of certain properties can be easily (and legally) streamed real-time. In this project tweets with the hashtag #timeless were collected from Monday, February 20, 2017, 2:34 pm until Tuesday, February 21, 2017, 4:54 pm (ET).
In accordance with the Twitter Developer Agreement, actual usernames are omitted from this analysis, only aggregate information is presented.
Actual downloading was carried out with the use of tweepy python package via a python script inspired by this blog post.
Altogether 52,041 tweets were posted in the above mentioned time period, this results in an average of 33 tweets per minute. The tweets came from 8,996 different users, 102 of which are verified. Users tweeted an average of 5.785 tweets. Verified users tweeted a bit more on average (6.314) than non-verified users (5.779), but the difference did not appear to be significant. 31,190 out of all tweets were retweets. 18,019 tweets had some type of media content (photos, videos or links).
It is a reasonable assumption that tweet activity peaks around the air time of the TV show, especially because many of the cast members engage in live-tweeting. The figure below demonstrates the number of tweets posted in a given minute. The EC (East Coast) air time is easily identifiable from the prominent peak starting at February 20, 9:00 pm with about a width of one hour. The West Coast (WC) was slightly less active, but they too left a detectable mark at February 11, 0:00 am until 1:00 am the same day. (These are of course the WC equivalents of 9 pm to 10 pm.)
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import networkx as nx
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
from plotly import tools
init_notebook_mode()
from datetime import datetime
import re
import pandas as pd
from datetime import timedelta
import collections
import itertools
from matplotlib import cm
import matplotlib.pyplot as plt
import string
import json
tweets = []
files = ['F:/U0/Orsi/ELTE/PhD/twitter/timeless1.json', 'F:/U0/Orsi/ELTE/PhD/twitter/timeless2.json']
for file in files:
for line in open(file):
try:
tweets.append(json.loads(line))
except:
pass
def get_datetimes(date_list):
return [datetime.strptime(t, '%a %b %d %H:%M:%S +0000 %Y') for t in date_list]
def get_tweets_per_minute(datetime_list):
idx = pd.DatetimeIndex(datetime_list)
ones = [1]*len(datetime_list)
db = pd.Series(ones, index=idx)
per_min = db.resample('1T').sum().fillna(0)
df = pd.DataFrame(per_min)
df['time'] = df.index
mins = list(df['time'])
counts = list(df[0])
return mins, counts
def append_annotation(annotations_list, time, text, ax=0, ay=-100, color='navy'):
annotations_list.append(dict(
x=time,
y=0,
xref='x',
yref='y',
text=text,
showarrow=True,
arrowhead=0,
ax=ax,
ay=ay,
font=dict(
color=color,
size=12
),
arrowcolor='navy',
))
return annotations_list
def append_annotation_to_peak(annotations_list, time, times, counts, text, ax=-100, ay=-100, color='navy'):
annotations_list.append(dict(
x=time,
y=counts[times.index(time)],
xref='x',
yref='y',
text=text,
showarrow=True,
arrowhead=3,
ax=ax,
ay=ay,
font=dict(
color=color
),
arrowcolor=color,
# bgcolor = 'white'
))
return annotations_list
def highlight_timerange(shapes_list, start, end, annotations_list, text):
shapes_list.append(dict(
type='rect',
xref='x',
yref='paper',
x0=start,
y0=0,
x1=end,
y1=1,
fillcolor='#d3d3d3',
opacity=0.2,
line=dict(width=0)))
annotations_list.append(dict(
x=start+(end-start)/2,
y=0.8,
xref='x',
yref='paper',
text=text,
showarrow=False,
ax=0,
ay=0,
font=dict(
color='#000000',
size=12
),
))
return shapes_list, annotations_list
def get_hashtags_with_fav(all_tweets_list = tweets):
if isinstance(all_tweets_list, list):
hashtagslist = [tweet['entities']['hashtags'] for tweet in all_tweets_list if tweet['entities']['hashtags'] != []]
favlist = [tweet['favorite_count'] for tweet in all_tweets_list if tweet['entities']['hashtags'] != []]
# hashtagslist = np.array(hashtagslist)[np.array(hashtagslist) != []]
# favlist = np.array(favlist)[np.array(hashtagslist) != []]
return [[k['text'] for k in j] for j in hashtagslist], favlist
else:
return [k['text'] for k in all_tweets_list['entities']['hashtags']], all_tweets_list['favorite_count']
t_text = [t['text'] for t in tweets]
t_time = [t['created_at'] for t in tweets]
t_hashtags_dict = [t['entities']['hashtags'] for t in tweets]
t_mentions_dict = [t['entities']['user_mentions'] for t in tweets]
t_favs = [t['favorite_count'] for t in tweets]
t_replyto = [t['in_reply_to_screen_name'] for t in tweets]
t_lang = [t['lang'] for t in tweets]
t_username = [t['user']['screen_name'] for t in tweets]
t_userver = [t['user']['verified'] for t in tweets]
t_isretweet = [re.match("^RT @.*:", t)!=None for t in t_text]
t_retweetuser = [t.split(':')[0].split('RT @')[1] if rt else None for t,rt in zip(t_text, t_isretweet)]
times_dt = get_datetimes(t_time)
times_real = [t-timedelta(hours=6) for t in times_dt]
mins, counts = get_tweets_per_minute(times_real)
data = [Bar(
x=mins,
y=counts, name='tweets per minute',
marker=dict(
color='lightsteelblue',
line=dict(
# color='rgb(8,48,107)',
width=0,
))
)]
annotations = []
shapes = []
highlight_timerange(shapes_list=shapes, start=datetime(2017, 2, 20, 21, 0, 0), end=datetime(2017, 2, 20, 22, 0, 0),
annotations_list=annotations, text='East coast')
highlight_timerange(shapes_list=shapes, start=datetime(2017, 2, 21, 0, 0, 0), end=datetime(2017, 2, 21, 1, 0, 0),
annotations_list=annotations, text='West coast')
layout = Layout(title = 'Total number of tweets per minute', yaxis = dict(title='number of tweets'), annotations = annotations,
shapes=shapes,
xaxis=dict(range=[datetime(2017, 2, 20, 18, 0, 0), datetime(2017, 2, 21, 4, 0, 0)]))
fig = Figure(data=data, layout=layout)
iplot(fig)
Seeing that identifying air times is pretty staightforward from the number of tweets, a somewhat more ambitious task is to actually detect different scenes of the episode from the tweet activity of fans. At this point watching the episode becomes absolutely necessarry as it provides a plausible starting point for further analysis.
The two main scenes of the show that would most likely cause an increased tweet activity from the fans are the ones of Lucy and Wyatt saying goodbye in 1954 and the realisation that Lucy's mother is a member of Rittenhouse.
To filter tweets based on their relevance, for the Lucy & Wyatt scene only tweets containing the lyatt or wucy phrases were kept, both of these referring to the pairing of the show's two main characters.
For the last scene of the episode with Lucy's mother, tweets were filtered for the phrases mom and mother.
The figure below illustrates that both these events are detectable from monitoring the activity of relevant tweets and that the final scene of the show caused a general uproar on Twitter. However, a previously (naively) unexpected peak also appears for the Lucy & Wyatt tweets, right before the last scene of Lucy and her mother. This corresponds to the scene when Mason surprises Lucy & Wyatt in the act of not-yet-but-almost-kissing. (The amount of tweet activity suggests that fans were less than happy about the interruption.)
times_wucy = get_datetimes(np.array(t_time)[np.array([('#wucy' in t.lower() or '#lyatt' in t.lower()) for t in t_text])])
times_mom = get_datetimes(np.array(t_time)[np.array([('mom' in t.lower() or 'mother' in t.lower()) for t in t_text])])
times_real_wucy = [t-timedelta(hours=6) for t in times_wucy]
times_real_mom = [t-timedelta(hours=6) for t in times_mom]
mins_wucy, counts_wucy = get_tweets_per_minute(times_real_wucy)
mins_mom, counts_mom = get_tweets_per_minute(times_real_mom)
trace_wucy = Bar(
x=mins_wucy,
y=counts_wucy, name='phrases: "wucy" or "lyatt"',
marker=dict(
color='PLUM',
line=dict(
width=0,
)))
trace_mom = Bar(
x=mins_mom,
y=counts_mom, name='phrases: "mother" or "mom"',
marker=dict(
color='MEDIUMTURQUOISE',
line=dict(
width=0,
)))
data = [trace_wucy, trace_mom]
annotations = []
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 20, 21, 42, 0), times=mins_wucy, counts=counts_wucy,
text='"I cannot lose you again."<br>(East coast)', color='SALMON', ax=-50, ay=-50)
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 21, 0, 43, 0), times=mins_wucy, counts=counts_wucy,
text='"I cannot lose you again."<br>(West coast)', color='SALMON', ax=-70, ay=-50)
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 20, 21, 59, 0), times=mins_mom, counts=counts_mom,
text="Lucy's mom in Rittenhouse<br>(East coast)", color='SALMON', ax=50, ay=-50)
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 21, 0, 58, 0), times=mins_mom, counts=counts_mom,
text="Lucy's mom in Rittenhouse<br>(West coast)", color='SALMON', ax=50, ay=-50)
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 20, 21, 56, 0), times=mins_wucy, counts=counts_wucy,
text='"Maybe I do need to be open to the possibilities."<br>(East coast)', color='SALMON', ax=100, ay=-40)
append_annotation_to_peak(annotations_list=annotations, time=datetime(2017, 2, 21, 0, 55, 0), times=mins_wucy, counts=counts_wucy,
text='"Maybe I do need to be open to the possibilities."<br>(West coast)', color='SALMON', ax=-50, ay=-120)
shapes = []
highlight_timerange(shapes_list=shapes, start=datetime(2017, 2, 20, 21, 0, 0), end=datetime(2017, 2, 20, 22, 0, 0),
annotations_list=annotations, text='East coast')
highlight_timerange(shapes_list=shapes, start=datetime(2017, 2, 21, 0, 0, 0), end=datetime(2017, 2, 21, 1, 0, 0),
annotations_list=annotations, text='West coast')
layout = Layout(title = 'Number of phase-specific tweets per minute',
xaxis = dict(range=[datetime(2017,2,20,20,30,0),datetime(2017,2,21,1,30,0)]),
yaxis = dict(title='number of tweets'),
annotations = annotations,
shapes=shapes)
fig = Figure(data=data, layout=layout)
iplot(fig)
Given that Lucy and Wyatt seem to be mentioned together frequently in tweets, it could be interesting to see how often other pairs of characters appear together.
For this, the five main characters of the show (Lucy, Wyatt, Rufus, Flynn and Jiya) were considered and the number of tweets determined for every possible pairing.
As both Lucy & Wyatt and Rufus & Jiya are romantic pairings and have commonly used abbreviations, the phrases lyatt, wucy, jifus and riya were also considered alongside the actual names of the characters.
The heatmap below shows how likely it is for the main characters to appear in different pairs. Not unexpectedly, Lucy & Wyatt is the most mentioned pairing, but Lucy generally (paired with any of the characters) tends to appear in more tweets than anyone else. The pairing of Rufus & Jiya is reasonably popular as well.
chars = ['lucy', 'wyatt', 'rufus', 'jiya', 'flynn']
char_heatmap = np.zeros((len(chars), len(chars)))
for charpair in itertools.combinations(chars,2):
db = sum([(charpair[0] in t.lower() and charpair[1] in t.lower()) for t in t_text])
if ('lucy' in charpair and 'wyatt' in charpair):
db += sum([('#wucy' in t.lower() or '#lyatt' in t.lower()) for t in t_text])
if ('jiya' in charpair and 'rufus' in charpair):
db += sum([('#riya' in t.lower() or '#jifus' in t.lower()) for t in t_text])
char_heatmap[chars.index(charpair[0])][chars.index(charpair[1])] = db
char_heatmap[chars.index(charpair[1])][chars.index(charpair[0])] = db
colorscale_spec = []
inv=True
for i in range(10,121):
if (inv):
colorscale_spec.append([(float(i-10)/110), 'rgb'+', '.join(str(cm.magma(255-i)).split(',')[:3])+')'])
else:
colorscale_spec.append([(float(i-10)/110), 'rgb'+', '.join(str(cm.magma(i)).split(',')[:3])+')'])
annotations = []
for n, row in enumerate(char_heatmap):
for m, val in enumerate(row):
if (val>0):
annotations.append(
dict(
text=str(int(val)),
x=chars[m], y=chars[n],
xref='x1', yref='y1',
font=dict(color='black' if val < 2500 else 'white', size=10),
showarrow=False)
)
trace = Heatmap(x=chars, y=chars, z=char_heatmap, colorscale=colorscale_spec, showscale=True, colorbar=ColorBar(x=1,ticklen=0,thickness=10))
fig = Figure(data=[trace])
fig['layout'].update(
title="Characters mentioned together",
annotations=annotations,
yaxis=dict(ticks='', ticksuffix=' '),
xaxis=dict(ticks='', ticksuffix=' '),
width=500,
height=500,
autosize=False
)
iplot(fig)
As many of the tweets were retweets, it could potentially be interesting to see how retweets connect the Timeless community. To illustrate this, a network of users was created, the nodes of which represent Twitter users who posted at least 5 #Timeless tweets in the investigated time period. This step is necessary to keep the network at a managable size. Users (nodes of the network) are connected with an arrow (called a 'directed edge') if one of them retweeted a tweet of the other user. The arrow points to the user whose tweet is being retweeted and starts at the user doing the retweeting.
On the figure below, incoming arrows for each user are represented with a pinkish color and outgoing arrows with grey. Thus users with many pink lines are the ones whose tweets are retweeted many times. The coloring of each node also illustrates the number of incoming arrows, the darker the color of the node, the more their tweets are retweeted.
Actual usernames are not presented to comply with the Twitter Developer Agreement. The shapes of the nodes, however, differentiate between verified users (marked with stars) and non-verified users (marked with circles). (Many of the show's cast are verified users, but not all of them.) It is apparent that the tweets of verified users are generally more likely to be retweeted. (Stars generally have a darker color than circles.)
From calculating the clustering coefficients for all nodes, it appears that non-verified users have a much higher average clustering (0.41) than verified users (0.15). This suggests that non-verified users tend to get retweeted from a close group of friends, where each group member is likely to retweet many of the other members' tweets. On the other hand, verified users get retweets from essentially random users who do not necessarrily know or retweet each other and their only connection is the verified user they all like and retweet.
The diamater of the largest connected component of the whole graph is 5. This means that users who has retweeted or has been retweeted at least once are connected by no more than 5 other users via retweets. This is a telltale sign of a network having "small-world properties".
The power-low behaviour of the degree distribution also suggests that the network is scale-free, similarly to most real-life social networks.
min_tweet_count = 5
counter=collections.Counter(t_username)
all_relevant_users = []
for user,tweetcount in counter.items():
if (tweetcount>=min_tweet_count):
all_relevant_users.append(user)
graph = nx.DiGraph()
user_ver_dict = dict()
for i in range(len(t_username)):
user_ver_dict[t_username[i]]=t_userver[i]
graph.add_nodes_from(all_relevant_users)
for orig_user, rt_user in zip(t_username, t_retweetuser):
if (rt_user != None and orig_user in all_relevant_users and rt_user in all_relevant_users):
graph.add_edge(orig_user, rt_user)
pos = nx.spring_layout(graph)
nx.set_node_attributes(graph, 'pos', pos)
dmin=1
ncenter=0
for n in pos:
x,y=pos[n]
d=(x-0.5)**2+(y-0.5)**2
if d<dmin:
ncenter=n
dmin=d
p=nx.single_source_shortest_path_length(graph,ncenter)
magnitudes = 8
color_codes = plt.get_cmap('YlGnBu')
color_codes_list = [color_codes(i) for i in range(color_codes.N)]
color_codes_plain = np.array(color_codes_list[::int(len(color_codes_list)/(magnitudes+1))])
base = [1-1/3**k for k in range(int(magnitudes+1))]
base = base[::-1]
base = [1] + base
cs = [[float(i), 'rgb(' + ', '.join([str(k) for k in j[:3]]) + ')'] for (i,j) in zip(base, color_codes_plain)]
#cs = np.array(cs)
cs = cs[::-1]
edge_trace = Scatter(
x=[],
y=[],
line=Line(width=0.5,color='#E0E0E0'),
hoverinfo='none',
mode='lines')
for edge in graph.edges():
x0, y0 = graph.node[edge[0]]['pos']
x1, y1 = graph.node[edge[1]]['pos']
edge_trace['x'] += [x0, x1, None]
edge_trace['y'] += [y0, y1, None]
edgeend_trace = Scatter(
x=[],
y=[],
line=Line(width=1,color='lightpink'),
hoverinfo='none',
mode='lines')
for edge in graph.edges():
x0, y0 = graph.node[edge[0]]['pos']
x1, y1 = graph.node[edge[1]]['pos']
edgeend_trace['x'] += [x0+(x1-x0)*0.9, x1, None]
edgeend_trace['y'] += [y0+(y1-y0)*0.9, y1, None]
node_trace_nonver = Scatter(
x=[],
y=[],
text=[],
mode='markers',
hoverinfo='text',
marker=Marker(
#showscale=True,
showscale=False,
# colorscale options
# 'Greys' | 'Greens' | 'Bluered' | 'Hot' | 'Picnic' | 'Portland' |
# Jet' | 'RdBu' | 'Blackbody' | 'Earth' | 'Electric' | 'YIOrRd' | 'YIGnBu'
#colorscale='YIGnBu',
colorscale=cs,
cmin=min(list(graph.degree().values())),
#cmax=100,
cmax=max(list(graph.degree().values())),
reversescale=True,
color=[],
size=5,
colorbar=dict(
thickness=15,
title='Node Connections',
xanchor='left',
titleside='right'
),
line=dict(width=0.5)))
node_trace_ver = Scatter(
x=[],
y=[],
text=[],
mode='markers',
hoverinfo='text',
marker=Marker(
symbol='star',
# colorscale options
# 'Greys' | 'Greens' | 'Bluered' | 'Hot' | 'Picnic' | 'Portland' |
# Jet' | 'RdBu' | 'Blackbody' | 'Earth' | 'Electric' | 'YIOrRd' | 'YIGnBu'
#colorscale='YIGnBu',
colorscale=cs,
showscale=False,
cmin=min(list(graph.degree().values())),
cmax=max(list(graph.degree().values())),
#cmax=100,
reversescale=True,
color=[],
size=15,
# colorbar=dict(
# thickness=15,
# title='Node Connections',
# xanchor='left',
# titleside='right'
# ),
line=dict(width=0.5)))
for node in graph.nodes():
x, y = graph.node[node]['pos']
if (user_ver_dict[node]):
node_trace_ver['x'].append(x)
node_trace_ver['y'].append(y)
node_trace_ver['marker']['color'].append(graph.in_degree()[node])
#node_info = '# of retweets: '+str(graph.in_degree()[node]) + '<br>' + node
node_info = '# of retweets: '+str(graph.in_degree()[node])
node_trace_ver['text'].append(node_info)
else:
node_trace_nonver['x'].append(x)
node_trace_nonver['y'].append(y)
node_trace_nonver['marker']['color'].append(graph.in_degree()[node])
#node_info = '# of retweets: '+str(graph.in_degree()[node]) + '<br>' + node
node_info = '# of retweets: '+str(graph.in_degree()[node])
node_trace_nonver['text'].append(node_info)
fig = Figure(data=Data([edge_trace, edgeend_trace, node_trace_nonver, node_trace_ver]),
layout=Layout(
height=900,
width=900,
title='Network of users based on retweets',
titlefont=dict(size=16),
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
xaxis=XAxis(showgrid=False, zeroline=False, showticklabels=False),
yaxis=YAxis(showgrid=False, zeroline=False, showticklabels=False)))
iplot(fig)
Usually only a very few users share their geographic location when posting a tweet, in the current set of Timeless tweets this number is 17. Thus any conclusion drawn from the figure below is somewhat unreliable. The main reason for including it here nevertheless is that it plainly illustrates that Timeless has fans all around the world, apparently even in the middle of the Pacific Ocean. (Clearly, dolphins are as thirsty for historical knowledge as the best of us.)
t_geo = [t['geo'] for t in tweets]
notnone = [x is not None for x in t_geo]
long = [t['coordinates'][0] for t in np.array(t_geo)[notnone]]
lat = [t['coordinates'][1] for t in np.array(t_geo)[notnone]]
data = [ dict(
type = 'scattergeo',
#locationmode = 'USA-states',
lon = lat,
lat = long,
#text = df['text'],
mode = 'markers',
marker = dict(
size = 8,
opacity = 0.8,
symbol = 'circle',
line = dict(
width=1,
color='rgba(102, 102, 102)'
),
))]
layout = dict(
title = 'Location of Timeless tweets',
geo = dict(
scope='world',
# projection=dict( type='albers usa' ),
showland = True,
landcolor = "rgb(250, 250, 250)",
subunitcolor = "rgb(217, 217, 217)",
countrycolor = "rgb(217, 217, 217)",
countrywidth = 0.5,
subunitwidth = 0.5
),
)
fig = dict( data=data, layout=layout )
iplot( fig)
To get a basic estimation of what all these tweets are actually about, the most popular hashtags and phrases were collected. As all tweets contained the #Timeless hashtag from the start, this was exluded from further analysis.
The overwhelming dominance of the #RenewTimeless hashtag and the most popular phrases ("oh", "can" we "get" another "season", "please") all indicate that fans are fairly interested in the shows renewal for a second season.
hashtags, ht_favs = get_hashtags_with_fav()
ht_dict = dict()
for htl, f in zip(hashtags, ht_favs):
for ht in htl:
rht = ht.lower()
if (rht != 'timeless'):
if (rht in ht_dict):
ht_dict[rht].append(f)
else:
ht_dict[rht] = [f]
ht_list_all = []
ht_num_all = []
ht_mean_favs_all = []
for ht, l in ht_dict.items():
ht_list_all.append(ht)
ht_num_all.append(len(l))
ht_mean_favs_all.append(np.mean(l))
x = 1
numtoplot = 21
new_ht_list_all = np.array(ht_list_all)[np.array(ht_num_all) > x]
new_ht_mean_favs_all = np.array(ht_mean_favs_all)[np.array(ht_num_all) > x]
new_ht_num_all = np.array(ht_num_all)[np.array(ht_num_all) > x]
new_ht_list_all_withtags = ['#'+k for k in new_ht_list_all]
new_ht_list_all_withtags = np.array(new_ht_list_all_withtags)
trace_num = Bar(y=new_ht_list_all_withtags[new_ht_num_all.argsort()[-1*numtoplot:]],
x=new_ht_num_all[new_ht_num_all.argsort()[-1*numtoplot:]], orientation = 'h',
marker=dict(color='lightblue'))
data = [trace_num]
layout = Layout(
title = 'Most used hashtags',
height=500, width=800, yaxis = dict(tickfont=dict(size=10)),
showlegend=False,
xaxis = dict(title='number of tweets',
showgrid=True),
#range=[0, np.max(new_ht_num_all)],
#showticklabels=False),
margin=dict(b=100,l=100,r=100,t=60))
fig = Figure(data = data, layout=layout)
iplot(fig)
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
tweet_texts = [preprocess(t) for t in t_text]
punctuation = list(string.punctuation)
sw = []
for line in open('stopwords_list.txt'):
try:
sw.append(str(line).strip('\n'))
except:
pass
stop = sw + punctuation + ['rt', 'via', '', '…', 'amp', '’', '\U0001f3fc', '\U0001f3fb', '️', '・', '•', '”', '“']
countall = collections.Counter()
countall.clear()
for tweet_text in tweet_texts:
real_tokens = [token.lower() for token in tweet_text if token.lower() not in stop and not token.lower().startswith(('#', '@', '\\U')) and len(token) > 1]
countall.update(real_tokens)
most_common_dict = dict()
for pair in countall.most_common(100):
if (pair[1] >=50):
most_common_dict[pair[0]] = pair[1]
term_list_final = []
count_list_final = []
for term, count in most_common_dict.items():
term_list_final.append(term)
count_list_final.append(count)
term_list_final = np.array(term_list_final)
count_list_final = np.array(count_list_final)
trace_num = Bar(y=term_list_final[count_list_final.argsort()][-1*numtoplot:],
x=count_list_final[count_list_final.argsort()][-1*numtoplot:], orientation = 'h',
marker=dict(color='lightblue'))
data = [trace_num]
layout = Layout(
title = 'Most used phrases',
height=500, width=800, yaxis = dict(tickfont=dict(size=10)),
showlegend=False,
xaxis = dict(title='number of tweets',
showgrid=True),
#range=[0, np.max(new_ht_num_all)],
#showticklabels=False),
margin=dict(b=100,l=100,r=100,t=60))
fig = Figure(data = data, layout=layout)
iplot(fig)
Photos (click to see full size at the original URL):
Links
import IPython.core.display as di
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); \
jQuery(".prompt").toggle();}});</script>', raw=True)